Credit Card Fraud Detection

Packages

Data

Data Overview

Exploratory Data *Analysis*

Fraud and Valid Transactions in Data

The Dataset contains 492 frauds out of 284,315 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.173% of all transactions.

We can see most of the transactions are valid transactions.It shows that it's real world dataset. If we use this dataframe as the base for our predictive models and analysis, our algorithms will probably overfit since it will "assume" that most transactions are not a fraud.

Variable Summary

In all the features only Time, Amount, and Class (fraud or not fraud) are made the most sense.The other 28 columns were transformed using what seems to be a PCA (Dimensionality Reduction technique) in order to protect user identities.

Histograms of all Variables

By seeing the 1st graph which is of variable Time , we can see there are two peaks in the graph.

These as the time of the day like the peak is the day time when most people do the transactions and the depth is the night time when most people just sleeps. Because the data contains a credit card transaction for only two days, so there are two peaks for day time and one depth for one night time.

Scatter Plots

Correlation Matrix

Find out top features which are highly correlated.

We can see there are lots of features in this model but having lots of feature doesn't always helps to the model, Instead of we might get in overfitting issues and get bad results.

**Imbalanced Data**

Model Building

Logistic Regression

KNN Classifier

Random Forest Classifier

Gaussian Naive Bayes

XGBoost Classifier

Model Evaluation

Model performance metrics

Compare Model Metrics

Receiver Operating Characteristic (ROC) - Curves

Precision Recall Curves

**Unsupervised Outlier Detection**

Now that we have processed our data, we can begin deploying our machine learning algorithms. We will use the following techniques:

Local Outlier Factor (LOF)

The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood.

Isolation Forest Algorithm

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

Model Building and Performance

Confuion Matrix

Compare Models

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||